Automatic Abstracting of Textual Material
نویسندگان
چکیده
Our i n i t i a l e f f o r t s a t automat ic a b s t r a c t i n g began in 1969 as p a r t of a more genera l q u e s t i o n answering system (Tharp , 1969; Tharp and K r u l e e , 1969) which made use of shor t s t o r i e s about f a mous d i s c o v e r i e s taken from a c h i l d r e n ' s encyc lo p e d i a . The o r i g i n a l t e x t was mapped i n t o a set of p r e d i c a t e s r e p r e s e n t i n g the l o g i c a l content of the s t o r y . Although the main emphasis in t h i s system was on q u e s t i o n a n s w e r i n g , t h e r e were two quest ions t h a t could be asked t h a t made use of a summarizing and a b s t r a c t i n g c a p a b i l i t y . These were : "Who is the c e n t r a l c h a r a c t e r in the s t o r y ? " and "What is the main theme of the s tory?" In our c u r r e n t e f f o r t s ( T a y l o r , 1 9 7 5 ) , we have extended Tharp 's methods in order to dea l more e x p l i c i t l y w i t h the problem of automatic abs t r a c t i n g . The system represents the meaning of a t e x t in terms of the semantic networks of Simmons (1973) which are based on the case grammar r e l a t i o n s h i p s o f F i l l m o r e ( 1 9 6 8 ) . In t h i s form o f r e p r e s e n t a t i o n , the nodes are words or concepts w h i l e the arcs represen t a case grammar r e l a t i o n sh ip t h a t e x i s t s between a p a i r of nodes. Using the g r a p h i c a l techniques of Ramamoorthy ( 1 9 6 6 ) , the system i d e n t i f i e s a p o r t i o n of the o r i g i n a l ne twork , namely, the maximal ly connected subgraph (or g r a p h s ) . Then us ing the techniques of s i g n a l f l ow graph a n a l y s i s , the system i d e n t i f i e s nodes t h a t are most i n f l u e n t i a l w i t h i n the maximal ly connected subgraph. By proceeding i t e r a t i v e l y , us ing t h i s p a i r of t e c h n i q u e s , a subgraph is obt a i n e d which serves as an a b s t r a c t of the o r i g i n a l t e x t . As a f i n a l s t e p , aga in us ing a t e c h nique due to Simmons and Slocum ( 1 9 7 2 ) , the subgraph is conver ted back i n t o a se t of n a t u r a l l a n guage sentences as the f i n a l ou tpu t . The r e s u l t s o f these i n i t i a l a t tempts are reasonably encouraging a l though c e r t a i n p r a c t i c a l d i f f i c u l t i e s have been encountered. For example, even a shor t sample of t e x t (one or two pages) leads to the f o r m a t i o n of a complex network t h a t i s d i f f i c u l t t o s t o r e . However, computer t ime f o r the processing of these networks is not excessive ( l e s s than ten seconds on a CDC 6400) and l i m i t a t i o n s of space are more ser ious than of computer t i m e . Dur ing the past y e a r , we have in t roduced a s e r i e s o f m o d i f i c a t i o n s i n t o the o r i g i n a l system (L indner , 1 9 7 6 ) . One of these has to do w i t h t e x t s t h a t have m u l t i p l e themes, such as a main theme and some s u b s i d i a r y themes. In our o r i g i n a l system, the f i r s t p a r t o f the process leads to the i d e n t i f i c a t i o n o f s e v e r a l maximal ly connected subgraphs (MSC's) . However, only one of t h e s e — t h e l a r g e s t i s chosen as the bas is f o r developing an a b s t r a c t . This subgraph u s u a l l y does c o n t a i n the main theme plus some r e l a t e d d e t a i l s . A secondary theme may w e l l be conta ined in a second MCS. Thus, by choosing on ly a s i n g l e MCS, one b iases the system towards the development of an a b s t r a c t t h a t overemphasizes d e t a i l s r e l a t i n g t o the main theme w h i l e i g n o r i n g secondary themes. Accord ing ly , we experimented w i t h longer mul t i p l e t h e m e t e x t s and w i t h a procedure t h a t would s e l e c t the l a r g e s t MCS plus one or more a d d i t i t i o n a l MCS's. The r e s u l t i n g a b s t r a c t appears to be much improved. For example, using as a t e x t a book r e v i e w , t h ree important MCS's were i d e n t i f i e d , the f i r s t concerned p r i m a r i l y w i t h a d i s cussion of the author and the second and t h i r d w i t h the main c h a r a c t e r of the book and the theme of the n o v e l , r e s p e c t i v e l y . Thus, an a b s t r a c t making use of a l l t h r e e leads c l e a r l y to a much more balanced p r e s e n t a t i o n than does an a b s t r a c t o f equal l ength t h a t r e f e r s only to the au thor . Our o r i g i n a l program was unable to handle networks of more than 300 nodes. B a s i c a l l y , the r e v i s e d technique deals w i t h the t e x t , paragraph by paragraph. As a f i r s t s t e p , the network assoc ia ted w i t h each paragraph is reduced, one at a t i m e , thus o b t a i n i n g a sequence of reduced n e t works, one from each paragraph. These networks are recombined i n t o a s i n g l e network and conver ted back i n t o an a b s t r a c t . These a b s t r a c t s obta ined are not u n l i k e those obta ined through the use of the o r i g i n a l method. Most i m p o r t a n t l y , w i t h t h i s m o d i f i c a t i o n , i t i s the l eng th o f each paragraph t h a t i s c r i t i c a l r a t h e r than the length o f the t e x t as a whole , a l t h o u g h , as the number of p a r a graphs i n c r e a s e s , one aga in runs i n t o d i f f i c u l t i e s in s t o r i n g the reduced networks f o r each of a s e r i e s of paragraphs. We now f i n d ourselves faced w i t h the problem of i n t r o d u c i n g m o d i f i c a t i o n s t h a t would lead to a s i g n i f i c a n t improvement i n the c a p a b i l i t i e s o f t h i s a b s t r a c t i n g system. We propose to in t roduce some r a d i c a l changes of which the f o l l o w i n g a re perhaps the most s i g n i f i c a n t . 1 . In many r e s p e c t s , networks r e p r e s e n t a t i v e o f the meaning of a t e x t make use of l o g i c a l p r e d i cates as the bas ic semantic u n i t s and the h i g h e r l e v e l semantic u n i t t h a t i s computed i s the p rod uct of a "bottom-up" form of a n a l y s i s w i t h the p r o p o s i t i o n s be ing r e l a t e d i n p a i r s u n t i l the N a t u r a l L a n * u a s e 6 : T a y l o r 1 1 7 o v e r a l l network emerges as a f i n a l p roduct . As an a l t e r n a t i v e , we propose t h a t the h i g h e r l e v e l s e mant ic a n a l y s i s should be "top-down" in the sense of making p r e d i c t i o n s about the o v e r a l l themat ic s t r u c t u r e o f the m a t e r i a l be ing processed. Moreo v e r , t h i s a n a l y s i s should make use of a h i g h e r l e v e l or semantic grammar much l i k e the themat ic grammar t h a t Rumelhart (1975) has proposed f o r the themat ic a n a l y s i s o f c h i l d r e n ' s s t o r i e s . 2 . Secondly , we are assuming t h a t a b s t r a c t i n g is n o r m a l l y a dynamic process t h a t should be p r i m a r i l y l o g i c a l o r q u a l i t a t i v e in form and making use of what we might r e f e r to as s u b s t i t u t i o n or condensing o p e r a t o r s . F o r m a l l y , these operators might resemble an axiom or theorem in mathemat ics , s t a t i n g t h a t c e r t a i n content can b e s u b s t i t u t e d f o r c e r t a i n o ther c o n t e n t . These operators w i l l "condense" in the sense of t a k i n g a network r e p r e s e n t i n g a p o r t i o n of t e x t and r e p l a c i n g i t w i t h a s i m p l i f i e d network or perhaps a s i n g l e node. Sometimes, the bas is f o r condensing the t e x t is made e x p l i c i t in the t e x t , in which case the opera t o r is not u n l i k e a procedure f o r i d e n t i f y i n g c e r t a i n types of phrases in c o n t e x t . For example, one o f t e n encounters sentences in the form: "There are two main methods f o r the p roduc t ion of ." or "When c o n s t r u c t i n g a , one immediate ly faces t h r e e types of problems." Thus, in the a b s t r a c t , one wants to i d e n t i f y in one case the two main methods and in the o ther case the t h r e e types of problems. Moreover , one probably wants to name the methods or problems w h i l e i g n o r ing a l l o f the a m p l i f y i n g d e t a i l s . U n f o r t u n a t e l y , under many c i rcumstances , such s t rong c lues may not be g iven e x p l i c i t l y and must be i n f e r r e d in much the same way t h a t answers not e x p l i c i t l y cont a i n e d in a d a t a base can be i n f e r r e d by problem s o l v i n g or theorem prov ing methods. Thus, in our r e v i s e d system we want to i n c l u d e a c a p a b i l i t y f o r "p rov ing" t h a t a set of summarizing statements can be i n f e r r e d from the o r i g i n a l da ta base . In s h o r t , we are proposing two major m o d i f i c a t i o n s to our present a b s t r a c t i n g program in order to make the system per form in a moce "humanl i k e " f a s h i o n and in order to develop a system w i t h a s i g n i f i c a n t l y improved l e v e l of competence.
منابع مشابه
Documentary Abstracting: Toward a Methodological Model
In the general abstracting process (GAP), there are two types of data: textual, within a particularly framed trilogy (surface, deep, and rhetoric); and documentary (abstractor, means of production and user demands). For its development, the use of the following disciplines, among others, is proposed: linguistics (structural, transformational and textual), logic (formal and fuzzy) and psychology...
متن کاملAdding Semantic Metadata to Audio-video Material by Automatic Analysis of Complementary Sources
We present in this paper actual work on adding semantic metadata to multimedia material, on the base of the results of the automatic analysis applied to associated language material, being speech transcripts or various types of textual documents related to video/image material
متن کاملAutomatic Comprehension of Textual User Requirements and their Static and Dynamic Modeling
Requirements engineering is the most important activity in software engineering, and is concerned with the gathering and understanding of user requirements written in natural language (NL). There is a gap between the textual description of the software to be developed, and the software UML models abstracting the static or dynamic views of the software. Our research is aimed at filling this gap ...
متن کاملA survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملUsing Linguistic Knowledge in Automatic Abstracting
We present work on the automatic generation of short indicative-informative abstracts of scientific and technical articles. The indicative part of the abstract identifies the topics of the document while the informative part of the abstract elaborate some topics according to the reader's interest by motivating the topics, describing entities and defining concepts. We have defined our method of ...
متن کاملWhere Does Information Come From? Corpus Analysis for Automatic Abstracting
We report on our study of a corpus of abstracts and parent documents to determinate which structural parts of the parent document are used to extract useful information for an abstract. The results give us a sound basis for automatic abstracting of research articles. Our method for automatic abstracting, called selective analysis, is intended to produce user-oriented abstracts which are indicat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1977